Which chemical properties influence the quality of red wines? In this project we’ll try to answer this question by exploring the red wine data set.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Some initial observations here:
quality is an ordered, categorical, discrete variable. Most wines are rated as 6 on a 10 point scale, 75% rated as 6 or below.density appears to have a small amount of variance, while it looks like there is much more variance in residual.sugar and chlorides.citric.acid is 0.Now let’s look at the distributions of the variables.
Some observations on these:
volatile.acidity, density and pH look nearly normal.residual.sugar and chlorides have extreme long tail.citric.acid appears to have a large number of zero values.## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
There is a high concentration of residual sugar value around 2.2 (the median) with some outliers along the higher ranges.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
We see a similar distribution with chlorides. It peaks at around 0.079 (the median).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Number of zero-values:
## [1] 132
This is really a strange distribution. 8% (132/1599) of wines do not present citric acid at all.
There are 1599 observations of 13 variables in red_wine data set.
I’m most interested in the quality and how other variables affect it. The quality is scored between 0 and 10, but we only have observations with a max of 8 and min of 3. And the average quality is 5.636.
I won’t be sure until I look at correlations between variables and some bivariate plots. But volatile.acidity, citric.acid and alcohol seem to be features to do with taste of wine.
Not yet.
Some variables like residual.sugar and chlorides are distributed with a long tail. And I noticed that 8% of citric.acid values are zero.
I haven’t performed any operations yet.
Quantitatively, the following variables have relatively higher linear relationship with quality:
High linear relationships between other variables:
Let’s see more details.
Among all features alcohol has the strongest correlation with red wine quality (0.476).
The wines rated as 3 all have alcohol values less than or equal to 11%, while roughly 75% of wines rated as 7 or 8 have alcohol values greater than 11%.
With all six quality levels, the plots start looking messy. I created a categorical variable rating, classifying the wines as low (rating 0 to 4), medium (rating 5 and 6), and high (rating 7 to 10).
## low medium high
## 63 1319 217
We see that lower and medium quality wines are less common with the increase in alcohol levels. We also see that at higher alcohol levels, there are more higher quality wines.
There is a clear positive relationship between alcohol and quality. It makes sense since higher alcohol content would be related to a higher concentration of flavor. Lower concentrations of alcohol would likely have more of a “watery” mouthfeel in comparison and might not be perceived has being of a high quality.
Volatile acidity has a negative but the second strongest correlation with quality (-0.391).
I added jitter and transparency to prevent overplotting. It definitely looks like there is a negative correlation between the two.
The trend is very clear, the lower the volatile acidity level the higher the wine quality. Actually it does make sense, since too high volatile acidity level can lead to an unpleasant, vinegar taste.
Now let’s look at the fixed acidity, which has a less meaningful correlation with quality (0.12).
As expected, the correlation is not as obvious as it between volatile acidity and quality. How about TA (total acid), the combination of fixed acidity and volatile acidity?
Well, maybe there is a trend, but still not as clear as volatile acidity. It is not a surprise, since wine on the taste is much more complex. Different types of acid will affect our feelings of it. For example, during the ageing process of Chardonnay, the malic acid will convert to lactic acid gradually, the sharp acid taste will become more smooth.
The third strongest correlation feature for quality is sulphates (0.25). This coefficient is not so meaningful, but let’s have a look first.
Here again I added jitter and some transparency to prevent overplotting. There does appear to be a trend toward higher sulphate levels in higher rated wines. But there also are a large number of outliers for the wines rated as 5 or 6.
There is a long tail! Maybe we should try to take a log.
It’s much better. Let’s take a look at the correlation.
## cor
## 0.3086419
It is higher than previous 0.25. It makes the variable more meaningful for the wine quality.
Now let’s look at citric acid and quality, they have a correlation coefficient of 0.23. It’s not so ideal neither.
There is a large amount of variance in these values. But I can see a positive trend, the citric acid median values increase steadily with each successive quality rating, from 0.035 g/dm3 for wines rated as 3, up to 0.420 g/dm3 for wines rated as 8.
We see that there are a lot of wines have low citric acid concentration (also for high rating wines). This is consistent with our previous exploration, that 8% wine does not appear any citric acidity at all. As we know that in contrast to volatile acidity, citric acidity add freshness to the wine. But I think it is not a necessary feature to become quality wine.
Here, we’ll take a look at ph, which has the weakest correlation with quality (0.028).
Does this mean ph level is meaningless for good wine quality?
I don’t think so. Actually, with an appropriate ph level, the wine will present a better color; the growth of bacterial will under control; and together with TA (total acid) we can initially determine the taste and style of a wine. This feature is so important that every winemaker concerns of it.
And our samples are much more normal wines than excellent or poor ones. We could see from the plot, most wines have a ph level within 3.2 to 3.4 which is already an appropriate range of ph level for red wines.
Finally, I’d like to look at quality and residual sugar plotted against each other. They have the second weakest correlation (0.031).
Wow, it has such a small amount of variance! But it does make sense. As we know, based on sweetness, wine can be categorised into several types, dry, medium, sweet and so on. Each type of wine can be good or bad. So this variable does not seem to be a feature to measure the quality of a wine.
The following 4 combinations have strongest overall correlations in the dataset.
Some correlations are positive, some are negative. For me, these are all reasonable relationships.
For the main feature of interest in the dataset, quality has relatively strong correlations with 3 of the features: alcohol, volatile.acidity and log(sulphates).
alcohol has the strongest correlation with red wine quality (0.476). It shows a clear and positive correlation between the two in the plots. Other than a slight dip for wines rated as a 5, the median values of alcohol steadily increased with each rating.
volatile.acidity has an negative correlation with red wine quality (-0.391). The variance decreased with each increase in rating.
Like alcohol, sulphates has a positive correlation with quality (0.251). But there are also a large number of outliers for the wines rated as 5 or 6. By applying log scale, the correlation coefficient is increased to 0.309.
fixed.acidity has relatively strong relationship with several features, like pH, citric.acid and density.
The strongest relationship is easy to guess. pH and fixed.acidity.
Now let’s look at the two variables with the strongest correlations with quality plotted against each other and colored by quality.
From this plot we see that in general, wines with higher alcohol content, having a lower volatile acidity concentration produces better wines.
Next, we’ll create a similar plot to examine volatile acidity and sulphates colored by quality
We see that having more sulphates on lower volatile acidity concentration tends to produce better wines. Compare with low and medium quality wines (rated as 3 to 6), this trend is not that obvious in high quality wines (rated as 7 or 8).
I think the trend in this plot is more clearer than the previous two. Higher alcohol content combine with higher sulphates concentration tend to produce higher quality wines.
Let’s have a look at the combination of pH, fixed.acidity and citric.acid. They represent the top 3 strongest correlation among all features.
This is a much more typical linear relationship. The trend is so clear, the lower the ph level the higher the fixed acidity concentration, and also higher citric acid.
Most of the relationships from this part of the analysis are consistent with what is seen in the earlier sections.
It looks like very low sulphates concentration almost completely prevent a wine to achieve a high quality rating. But on the other hand, there do are some high rated wines with very low alcohol content, and even with a slightly high volatile acidity.
I didn’t, because I think none of the relationship seems strong enough to creating a model.
Alcohol has the strongest correlation with quality (0.476). As the alcoholic content increases, the quality of wine tend to be as well. The wines rated as 3 all have alcohol level less than or equal to 11%, while roughly 75% of the high quality wines (rated as 7 or 8) have alcohol level greater than 11%.
Volatile acidity has a negative but the second strongest correlation with red wine quality. Together with alcohol, I think this plot does show how do the combination produce a higher quality wine. As we can see, higher alcohol content with lower volatile acidity concentration would likely do.
Quality and the three most influential features visualized together in one plot. The six facets indicate quality rating (from 3 to 8).
We see, with increasing of the quality rating, alcohol (y axis) content tend to be higher, while the volatile acidity concentration (x axis) tend to be lower, and the sulphates concentration (color scale) tend to be higher.
The red wine dataset contains 1,599 observation with 11 variables on the chemical properties, and it was provided in a clean format, without any missing data. My goal was to find out which chemical properties influence the quality of red wines.
I started by examining each of the feature to get a feel for the distributions and ranges of values. With correlation coefficients I picked out three features and plotted them with quality. Finally with putting the three features all together with quality in a plot, I was able to infer that they can influence the quality of red wines. These three features are Alcohol, volatile acidity and sulphates.
It is important to note, however, I found out, that there is a high concentration of wines in the middle ranges of the ranking, that is, our samples are much more normal wines than excellent or poor ones. We can not see the whole picture.
And I think the dataset is pretty limited with 12 chemical properties, it will be great if other variables such as grape type and wine age can be included for further investigation.